Optimized recompilation using hardware tracing

ABSTRACT

A tracing controller may utilize a binary execution trace mechanism to trace execution of compiled application machine code. The tracing controller may initiate hardware tracing to gather control-flow hardware traces of a method executing on a processor configured to generate hardware tracing information. The controller may generate a profile based on the hardware tracing information and initiate re-compiling or re-optimizing of the method in response to determining that the new profile differs from the previous profile. The controller may repeatedly profile and re-optimize a method until profiles for the method stabilize. Profiling and hardware tracing of an application may be selectively enabled or disabled allowing the controller to respond to later phase changes in application execution by re-optimizing, thereby potentially improving overall application performance.

This application is a continuation of U.S. application Ser. No.15/994,967, filed May 31, 2018, which claims benefit of priority to U.S.Provisional Application Ser. No. 62/650,812, filed Mar. 30, 2018, andwhich are incorporated herein by reference in their entirety.

BACKGROUND Field of the Disclosure

This disclosure relates generally to optimizing software compilers, andmore particularly to systems and methods for implementing optimizedrecompilation using hardware tracing.

Description of the Related Art

The gathering and analysis of processor traces has not generally been ofpractical utility in production software. Traditionally, applicationsare profiled using software instrumentation with the compilation andoptimization of code in the applications guided using profiles gatheredduring the execution of the application. However, softwareinstrumentation generally slows the application down, and so istypically used only in the early, warmup phase of code execution.Additionally, hardware tracing has been used to improve performance inthe context of static compilation. However, static compilation isperformed by the developer and thus cannot respond to changes inapplication behavior at the end-user site.

There are two common techniques utilized to collect runtime profiles: a)During interpretation of the application code before any machine code isgenerated, b) In the machine code generated by early compilation tier(s)in multi-tier compilation systems. The profiling in the interpretertakes place as each bytecode instruction is executed in the interpreter,adding to the overall execution time overhead of the interpreter.Moreover, every language interpreter must have its own profiler to feedthe optimizing compiler, even though multiple languages can make use ofa common compiler infrastructure.

Alternatively, or in addition to the profiling performed duringinterpretation, profiling can be performed through instrumentation ofthe machine code generated by an early compilation tier by emittingadditional instrumentation into the machine code generated. For example,machine code generated by a tier 1 compiler may emit (or insert)instrumentation code into machine code it generates so as to collectprofiling information about events of interest. Augmenting the machinecode with instrumentation comes at the cost of extra instructions toexecute along with the original program's instructions, slowing down theoverall execution.

Thus, profile collection through instrumentation is typically performedin the machine code generated by earlier compilation tiers but not inthe final machine code generated to be run after profiling.Additionally, static compilation performed by developers cannot respondto changes in application behavior once the application has beendeployed (e.g., at an end-user site).

One potential disadvantage of traditional profile collection mechanismsis that they are typically employed only at the start (i.e., duringwarmup) of a given program's execution because of their associatedperformance overheads. Once the necessary profiling information iscollected, the compiler generates optimized machine code by utilizingprofiling information collected during warmup. When using traditionaltechniques, there is typically no profiling performed after applicationwarmup (unless the machine code encounters an unexpected scenario,deoptimizes, and falls back to interpretation until the nextcompilation). Thus, traditional techniques cannot detect the changesthat might occur because of a phase change later in the execution of agiven program (e.g., due to changes in data distribution) and hencecannot generate machine code optimized to adapt to the behavior of theprogram in a given period of time.

Another potential disadvantage of existing profiling mechanisms is alack of context-sensitivity. Profiling information associated with aninstruction in a method may not be differentiated based on how theprogram arrived at that point in the execution. Instead collectedprofiles are averaged across all paths leading to that instruction. Thismay cause the compiler to miss optimization opportunities whilegenerating machine code, which potentially results in sub-optimalperformance.

SUMMARY

Described herein are various methods, techniques and/or mechanisms foroptimized recompilation using hardware tracing. Gathering and analyzingprocessor traces has been used in computer design and processorarchitecture research, but is generally not of practical utility inproduction software. A binary execution trace mechanism may allowsoftware to gather control-flow traces at low overhead utilizinghardware tracing, according to various embodiments described herein.

As noted above, existing profiling techniques (e.g., such as by usingsoftware instrumentation) may reduce or degrade performance (e.g.,slowdown) of an executing application. However, by utilizing a binaryexecution trace mechanism configured to provide hardware tracing, anexecuting application program may be profiled at any time duringexecution. Additionally, an application may be profiled (and thusrecompiled and/or reoptimized) while executing in a productionenvironment (e.g., after being deployed at customer/client site, whenusing real data, etc.). This may allow later phase changes (e.g.,changes that occur during execution), to be detected and responded to,potentially improving overall performance.

To overcome the drawbacks of traditional profiling techniques, binarytracing of the compiled machine code may be employed for low-overheadand context-sensitive profiling, such as by using hardware tracingfeatures available in modern processors. As one example, Intel's recentprocessors may be equipped with hardware support to provide applicationexecution traces, such as Branch Trace Store, Last Branch Records, andProcessor Trace. Although the encoding of the execution traces maydiffer among various tracing mechanisms, they all may provideinformation about the control flow of an executing binary, such asbranch directions (i.e., branch taken or not taken), target addresses ofdirect calls, indirect calls and jumps. The profiles extracted from thehardware-provided execution traces may then be utilized by the compilerwhen performing optimizations based on these newly generated profiles.

One potential benefit of binary (e.g., machine code) tracing in hardwareis that it shifts the instrumentation out of the thread underobservation into an observer thread. Thus, performance of the observedthread may be minimally degraded according to some embodiments.Additionally, tracing information generated using the techniquesdescribed herein may be generated independent of any instrumentationinstructions included in the application (and upon which other profilesmay be based). In some embodiments, hardware tracing information may beprocessed, such as to extract and summarize the profiling information tobe utilized by the compiler. Moreover, binary tracing may be turned onand off dynamically during application execution, so once the profilesstabilize tracing may be turned off at least temporarily and potentiallyturned on again if needed—something which static instrumentation is notgenerally capable of.

Another potential benefit of binary tracing is its ability to be able todetect and adjust to later phase changes (e.g., changes to applicationexecution, execution paths, etc.) without falling back tointerpretation. Finally, because binary (e.g., machine code) tracing canbe used to profile already compiled machine code, it may providecontext-sensitive profiles of the methods that are already inlined intheir caller method (if the compiler performed any inlining based on theprevious profiles it was provided with during compilation of the methodthat is traced). Thus, there may be no extra implementation effortnecessary to provide context sensitivity, according to some embodiments.Additionally, multiple interpreters using the same compilerinfrastructure may utilize the profiling information extracted withbinary tracing in hardware.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a logical block diagram illustrating a system configured toimplement optimized recompilation using hardware tracing as describedherein, according to one embodiment.

FIG. 2 is a logical block diagram illustrating a binary tracing controlflow loop, as in one embodiment.

FIG. 3 is a flowchart illustrating one embodiment of a method foroptimized recompilation of a method using hardware tracing.

FIG. 4 is a logical block diagram illustrating a compilation pipeline ofan optimizing dynamic compiler, as in one embodiment.

FIG. 5 is a flowchart illustrating one embodiment of a method forprofiling and optimizing an application using hardware tracing.

FIG. 6 is a logical block diagram illustrating the generation of aprofile as part of optimized recompilation using hardware tracing, inone embodiment.

FIG. 7 is a logical block diagram illustrating a control flow forhardware tracing, in one embodiment.

FIGS. 8A and 8B is a logical block diagram illustrating differencesbetween context insensitive and context sensitive profiling, in oneembodiment.

FIG. 9 is a is a logical block diagram illustrating one environmentsuitable for implementing optimized recompilation using hardwaretracing, in one embodiment.

FIG. 10 is a block diagram illustrating a computing system configured toimplement the disclosed techniques, according to one embodiment.

While the disclosure is described herein by way of example for severalembodiments and illustrative drawings, those skilled in the art willrecognize that the disclosure is not limited to embodiments or drawingsdescribed. It should be understood that the drawings and detaileddescription hereto are not intended to limit the disclosure to theparticular form disclosed, but on the contrary, the disclosure is tocover all modifications, equivalents and alternatives falling within thespirit and scope as defined by the appended claims. Any headings usedherein are for organizational purposes only and are not meant to limitthe scope of the description or the claims. As used herein, the word“may” is used in a permissive sense (i.e., meaning having the potentialto) rather than the mandatory sense (i.e. meaning must). Similarly, thewords “include”, “including”, and “includes” mean including, but notlimited to.

DETAILED DESCRIPTION OF EMBODIMENTS

This disclosure describes various methods, techniques and/or mechanismsfor implementing optimized recompilation using hardware tracing,according to various embodiments. Modern execution environments, such asvirtual machines (VMs) in some embodiments, typically rely onself-observation to examine their behavior and then utilize theseobservations to improve performance. Traditionally, executionenvironments may exploit two different observation approaches. Firstly,by noting a specific event at a specific point in an executing programor application. Secondly, by counting occurrences of events acrossvarious regions of the program.

For example, conventional interrupt-driven profiling may gather callstack information at timer-driven interrupts, loop trip counts maytrigger recompilation, and/or type histograms may be gathered bypolymorphic inline caches. However, to obtain a higher level ofperformance it may be necessary to obtain additional information aboutan event, such as information describing how the program arrived at thatpoint in the code. For example, optimizing a hot loop (e.g., a loop inexecuting source code that accounts for a relatively large amount of anapplication's overall execution time) containing complex control flowmay involve identifying common paths and special-casing optimization ofthe paths (of each path in some embodiments). While a variety of ad-hoctechniques may be used to address specific situations or approaches,utilizing adaptive profiling and/or optimized recompilation usinghardware tracing may, in some embodiments, provide a general techniqueusable in a wide variety of situations.

In the following detailed description, numerous specific details are setforth to provide a thorough understanding of claimed subject matter.However, it will be understood by those skilled in the art that claimedsubject matter may be practiced without these specific details. In otherinstances, methods, apparatuses or systems are not described in detailbelow because they are known by one of ordinary skill in the art inorder not to obscure claimed subject matter.

While various embodiments are described herein by way of example forseveral embodiments and illustrative drawings, those skilled in the artwill recognize that embodiments are not limited to the embodiments ordrawings described. It should be understood that the drawings anddetailed description thereto are not intended to limit the embodimentsto the particular form disclosed, but on the contrary, the intention isto cover all modifications, equivalents and alternatives falling withinthe spirit and scope of the disclosure. Any headings used herein are fororganizational purposes only and are not meant to be used to limit thescope of the description. As used throughout this application, the word“may” is used in a permissive sense (i.e., meaning having the potentialto), rather than the mandatory sense (i.e., meaning must). Similarly,the words “include”, “including”, and “includes” mean including, but notlimited to.

Some portions of the detailed description which follow are presented interms of algorithms or symbolic representations of operations on binarydigital signals stored within a memory of a specific apparatus orspecial purpose computing device or platform. In the context of thisparticular specification, the term specific apparatus or the likeincludes a general-purpose computer once it is programmed to performparticular functions pursuant to instructions from program software.Algorithmic descriptions or symbolic representations are examples oftechniques used by those of ordinary skill in the signal processing orrelated arts to convey the substance of their work to others skilled inthe art. An algorithm is here, and is generally, considered to be aself-consistent sequence of operations or similar signal processingleading to a desired result. In this context, operations or processinginvolve physical manipulation of physical quantities. Typically,although not necessarily, such quantities may take the form ofelectrical or magnetic signals capable of being stored, transferred,combined, compared or otherwise manipulated. It has proven convenient attimes, principally for reasons of common usage, to refer to such signalsas bits, data, values, elements, symbols, characters, terms, numbers,numerals or the like. It should be understood, however, that all ofthese or similar terms are to be associated with appropriate physicalquantities and are merely convenient labels. Unless specifically statedotherwise, as apparent from the following discussion, it is appreciatedthat throughout this specification discussions utilizing terms such as“processing,” “computing,” “calculating,” “determining” or the likerefer to actions or processes of a specific apparatus, such as a specialpurpose computer or a similar special purpose electronic computingdevice. In the context of this specification, therefore, a specialpurpose computer or a similar special purpose electronic computingdevice is capable of manipulating or transforming signals, typicallyrepresented as physical electronic or magnetic quantities withinmemories, registers, or other information storage devices, transmissiondevices, or display devices of the special purpose computer or similarspecial purpose electronic computing device.

In some embodiments, the techniques described herein may be implementedon a system including one or more processors that are configured togenerate hardware tracing information for an application executing onthe processor(s). FIG. 1 is a logical block diagram illustrating asystem configured to implement optimized recompilation using hardwaretracing, according to one embodiment. As illustrated in FIG. 1, system100 may include one or more tracing capable processors 110. Forinstance, a processor 110 may be equipped with features to extractexecution traces of executing application binaries. As merely oneexample embodiment, an Intel x86 processor may provide different binaryexecution trace mechanisms, such as Processor Trace (PT), Last BranchRecords (LBR) and Branch Trace Store (BTS). Other processormanufacturers (e.g., AMD, ARM) may provide similar capabilities in theirprocessors.

System 100 may also include an execution environment 120 configured tosupport the execution of one or more applications, programs and/or othersoftware. For example, in some embodiments execution environment 120 maybe a virtual machine (VM) and/or an adaptively-optimizing VM. Executionenvironment 120 may execute one or more service threads, such as forcompilation, garbage collection, etc., as well as one or moreapplication threads. In some embodiments, execution environment 120 maybe considered a combination of a hardware tracing processor and anoperating system configured to exploit that capability.

Executing within execution environment 120 may be one or moreapplications 130 including one or more methods 140. In some embodiments,application 130 and/or methods 140 may be compiled, optimized, traced,profiled, recompiled and/or reoptimized according to the techniquesdescribed herein. Compiler 160 may represent one or more compilermodules and/or any of various types of compilers, such as an optimizingdynamic compiler, a multi-tier compilation system, a just-in-timecompiler, etc., according to various embodiments. Compiler 160 may beconfigured to compile and/or optimize source code for application 130and/or methods 140.

System 100 may also include tracing controller 170, which may beconfigured to manage, direct, control and/or orchestrate hardware binarytracing performed on the machine code resulting from compilation ofapplication 130 and/or methods 140. For example, in some embodimentstracing controller 170 may be configured to interact with compiler 160,processor(s) 110, decoder 180 and/or trace buffer 190 to collecthardware traces, generate profiles as well as initiate recompilationand/or re-optimization of individual ones of methods 140. Thus, in someembodiments tracing controller 170 may orchestrate tracing actions andpass new profiles to the compiler to be used for compilation.

In some embodiments, a binary execution trace mechanism, such as may beimplemented by processors 110, may be configured to trace individualapplications, operating system (OS) code, hypervisors and/or executionenvironments (e.g., VMs) running under a hypervisor. However, differentbinary execution trace mechanisms may differ in how they encode thetraces, whether traces can be recorded continuously or sampled, and/orthe relative performance overhead imposed on an application beingtraced. The techniques described herein may also be implemented usingvarious binary execution tracing mechanisms according to differentembodiments. However, the techniques are described herein in terms ofProcessor Trace (PT) as one example and for brevity of explication.Processor Trace (PT) is one example of a binary execution tracemechanism of processor 110 that may be configured to gather control flowtraces of individual methods 140 of application 130. When active for aparticular hardware thread, a binary execution trace mechanism may writea trace of the method's execution into memory, such as into trace buffer190. Subsequently, post-processing software, such as tracing controller170 and/or decoder 180, may use as input to reconstruct the control flowof that thread.

In some embodiments, a binary execution trace mechanism may emit tracingdata at a high data rate, potentially overwhelming secondary storagesubsystems. For instance, in one example embodiment, a binary executiontrace mechanism may generate up to 500 MB of trace data per secondthereby potentially slowing down the traced program by 2-7%. In anotherexample embodiment, Linux perf was utilized to gather a trace for acomplete execution of the Richards benchmarks on Truffle/JavaScript,obtaining a 1.6 GB trace in 5 s of execution. Thus, in some embodiments,tracing data may be held in memory, such as in trace buffer 190 andconsumed from RAM, thereby potentially limiting the trace length (e.g.,to some seconds of execution). Additionally, in some embodiments,trace-gathering may be limited (e.g., to methods of interest), such asto prevent overwhelming amounts of data being emitted.

Thus, when hardware tracing an executing method, processor 110 may storetracing information in an encoded form, such as to reduce the amount ofemitted data. For instance, in some embodiments, the direction of eachbranch (i.e., branch taken or not taken) encountered during executionmay be encoded by a single bit. This potentially minimizes the data rateand/or overall data size required for tracing, but consequently mayplace a burden on decoding software.

Decoder 180 may be, or may include, one or more modules configured todecode tracing information (i.e., reconstruct the control flow, such asto extract profiles from the recorded traces) generated by a binaryexecution trace mechanism. In one embodiment, decoder 180 may representa decoder library that is either part of, or separate from tracingcontroller 170. Please note that when tracing a thread, processor 110may be instructed and/or configured to trace execution of a particularmethod 140 of application 130, as will be described in more detailsubsequently.

Various types of data (i.e., tracing information) may be recorded in thetrace and stored within trace buffer 190, such as outcomes ofconditional branches, target addresses of indirect jumps and calls,sources and/or targets of asynchronous control transfers (e.g.,exceptions, interrupts), as well as much additional detail, according tovarious embodiments. In some embodiments, a binary execution tracemechanism may avoid generating trace information that could otherwise bedetermined, such as by inspection, from the static binary (e.g., machinecode). For example, execution of unconditional branches or direct callsmay not be recorded in some embodiments. The trace data may be fed backinto subsequent compilations, as profile(s) 150, when recompiling and/orreoptimizing one or more of methods 140 and/or application 130.

While described herein generally as compiling, tracing, profiling,optimizing, recompiling and/or reoptimizing individual methods 140 ofapplication 130, the methods, mechanisms and/or techniques describedherein may also be applied to virtually any suitable portion of codebeing executed on processor(s) 110, according to various embodiments.While illustrated in FIG. 1 as separate modules, in some embodiments oneor more of compiler 160, tracing controller 170 and/or decoder 180 maybe implemented in a combined fashion as a single module. Additionally,compiler 160, tracing controller 170 and/or decoder 180 are merelylogical descriptions intended to aid in explanation and understanding ofthe methods, mechanisms and techniques described herein. In someembodiments, tracing controller 170 may include, or be part of, compiler160. In other embodiments, compiler 160 may include, or be part of,tracing controller 170. In yet other embodiments, tracing controller 170may be separate from, but may be configured to communicate and/orcoordinate with, compiler 160 and/or decoder 180. Thus, features and/ortechniques described herein as being performed by any of compiler 160,tracing controller 170 and/or decoder 180 may, in some embodiments, beperformed by a different one of compiler 160, tracing controller 170and/or decoder 180.

FIG. 2 is a logical block diagram illustrating a binary tracingcontroller loop, according to one embodiment. When implementingoptimized recompilation using hardware tracing, as described herein,tracing controller 170 (and/or compiler 160) may be configured torepeatedly select one or more of methods 140, as shown in block 200,enable hardware tracing of the selected methods, as in block 210 andgenerate profiles by post-processing the gathered trace information, asin block 230, according to one embodiment. Thus, tracing controller 170may profile application 130 in an iterative manner.

Tracing controller 170 may utilize any of various manners to selectwhich methods of an application to trace and profile, according tovarious embodiments. For example, in some embodiments tracing controller180 and/or compiler 160 may be configured to select which methods ofapplication 130 to trace and/or in what order to trace those methodsbased on the method's respective contribution to the total executiontime (or the total instruction count) of the application, as will bedescribed in more detail herein.

In some embodiments, a controller loop (and/or tracing controller 170)implementing the techniques described herein may be executing in astand-alone virtual machine thread (e.g., within execution environment120). The controller loop may start when the application starts and mayterminate at, or before, application termination. For example, in someembodiments the controller loop may end before the applicationterminates if the methods being profiled stabilize before theapplication terminates, as will be described in more detailsubsequently.

FIG. 3 is a flowchart illustrating one embodiment of a method foroptimized recompilation of a method using hardware tracing, according toone embodiment. As illustrated by block 310, tracing controller 170, mayinitiate non-instrumentation based hardware tracing of a method of anapplication executing within execution environment 120. For instance,the application may be executing on a system (e.g., a combination of ahardware tracing processor and an operating system configured to exploitthat capability) which is configured to provide one or more binaryexecution trace mechanisms.

The application and/or the method may have been previously compiled andor optimized, such as based on static optimization mechanisms and/orsoftware instrumentation of the application and/or method. Thus, thehardware tracing may be performed on compiled (and possibly optimized)machine code of the method. Additionally, the method may be one ofmultiple methods being traced simultaneously.

The tracing controller 170 may instruct one or more processors (on whichthe application is executing) to initiate hardware tracing of the method140, which may be referred to herein as the traced method. In response,the processor(s) may extract execution traces of the method while theapplication is executing. Various types of data may be recorded in thetrace, such as outcomes of conditional branches, target addresses ofindirect jumps and calls, as well as sources and targets of asynchronouscontrol transfers (e.g., exceptions, interrupts), as will be describedin more detail subsequently. The hardware based tracing may not include,nor rely, on any specific instrumentation of the application (e.g.,software instrumentation), in some embodiments, and may occur while theapplication is live—that is executing in a real, not a test orsimulated, execution environment, possibly using real data. In otherwords, instead of being profiled by the developer in a test orsimulation environment, the application may be profiled while it isexecuting at the customer/client site using real customer/client data(e.g., in production).

As shown by block 320, tracing controller 170 may generate a profile forthe method based at least in part on a control flow trace of the methodcreated by the hardware tracing of the method. In some embodiments,generating a profile may involve decoding the trace data (e.g.,reconstructing the control flow) from the hardware tracing, such as bydecoder 180. Additionally, tracing controller 170 may map the profiledetails and/or other trace information to the application's source code,based on source code positions propagated through the compilationpipeline and associated with their corresponding instructions in themachine code being executed.

If, as indicated by the positive output of decision block 330, the newlygenerated profile differs from the previous profile, the tracingcontroller 170 may initiate a recompiling and/or reoptimizing of themethod, as shown in block 340. In some embodiments, the tracingcontroller 170 may compare the profile with a previously generatedprofile for the method. For instance, the previously generated profilemight be provided by an earlier stage in the compilation pipeline or anearlier hardware-based profiling session to be used for the compilationof the method of interest. The threshold used to determine whether twoprofiles differ enough to warrant recompiling and/or reoptimizing may beconfigurable and may vary from embodiment to embodiment. Thus, if thenew profile differs more than the threshold amount from the previouslygenerated profile, the tracing controller 170 may recompile/reoptimizethe method and may then proceed to profile the method again, asindicated by the arrow from block 340 to block 320.

Alternatively, if the newly generated profile does not differ (e.g.,more than the threshold amount) from the previously generated profile,as indicated by the negative output of decision block 330, tracingcontroller 170 may mark the method as stable, as shown in block 350. Forexample, the tracing controller 170 may maintain a table of methods tobe profiled and may indicate within the table that the method is stable(e.g., does not differ significantly from the previously generatedprofile). The tracing controller 170 may at least temporarilydiscontinue profiling/tracing a method that has been marked as stable(or may instruct the processor to discontinue profiling/tracing themethod), but may again profile/trace a method marked once marked asstable if conditions warrant, such as after a certain amount ofexecution time has passed since the method was last profiled/traced, ifperformance of the application/method changes significantly, etc.

Tracing controller 170 may then continue profiling the application. Forinstance, if tracing controller 170 marked the method as stable, thetracing controller 170 may continue to profile/trace the applicationwithout profiling/tracing the same method. If, however, tracingcontroller 170 initiated a recompilation/reoptimization of the method,the tracing controller 170 may continue to profile/trace the applicationincluding profiling/tracing the same method again.

In some embodiments, binary hardware tracing of the optimized machinecode generated by the last tier in a just-in-time compilation pipelinemay be performed utilizing hardware features, as depicted in FIG. 4.FIG. 4 is a logical block diagram illustrating a compilation pipeline ofan optimizing dynamic compiler, as in one embodiment. A compilationpipeline may include various logical stages, such as interpretation 410,tier 1 compilation 430, and tier 2 compilation 460. FIG. 4 alsoillustrates the control flow between the various stages as well as someof the results of the stages. For example, interpretation 410 may resultin one or more interpreter profiles 420, tier 1 compilation may generatemachine code 440, which may be used to generate one or moreinstrumentation profiles 450 and tier 2 compilation may generateoptimized machine code 480, potentially iteratively using tracingprofiles 470.

Thus, application threads may not always run optimized machine code. Forexample, while the application is warming up (e.g., during the start ofthe application's execution) or when the machine code de-optimizes (suchas due to a failed speculation), interpreted and/or less optimizedmachine code, such as machine code 440 may be executed. Moreover, themachine code might not include all the control-flow paths compiled,thereby possibly necessitating a fallback (i.e., uncommon branch trap)to the interpreter or less-optimized machine code, according to someembodiments.

FIG. 5 is a flowchart illustrating one embodiment of a method forprofiling/optimizing an application using optimized recompilation usinghardware tracing as described herein. As shown in block 510, tracingcontroller 170 may initiate hardware based profiling of an executingapplication 130. For instance, application 130 may be executing on aprocessor 110 of system 100 configured to provide one or more binaryexecution trace mechanisms. Tracing controller 170 may initiate hardwaretracing in any of various manners, such as by executing one or morehardware instructions of processor 110 and/or loading one or moreregisters (or addresses) within processor 110 with relevant informationregarding an application, thread(s), method(s) and/or address range(s)to be traced. In general, the particular manner and or method used toinitiate or begin hardware tracing may vary from embodiment toembodiment.

As shown in block 520, tracing controller 170 may determine a method ofthe application to profile. For example, in one embodiment the tracingcontroller may be configured to rank some or all of the methods of theapplication based on their respective contribution to the totalexecution time of the application and trace the method(s) thatcontribute the most to the overall execution time before profilingmethods that contribute less to the total execution time. As describedherein methods that contribute significantly (e.g., relatively more thanother methods) to the overall execution time may be considered ‘hot’methods. Because any optimization performed for the hottest methods islikely to contribute more to overall performance improvements, tracingcontroller 170 may trace (some number of) the hottest methods first. Forexample, in one embodiment, the controller may trace the top F hottestmethods until one or more of those methods are marked as stable. Afterenough profiling data are collected and the controller may determinethat a method is stable (e.g., based on some criteria as will bedescribed subsequently), the controller may stop tracing stable methodsand begin to trace less hot methods (e.g., according to a sorted hotmethods list).

Tracing controller 170 may then profile the determined method (e.g., oneof the hot methods) based on hardware tracing of the executing method,as in block 530. For instance, the compiler may utilize one or morebinary execution trace mechanisms of the processor(s) on which theapplication is executing. The binary execution trace mechanism maygenerate a trace enabling the tracing controller (or otherpost-processing software) to reconstruct the control flow of the method(i.e., decode the trace), including various types of data, such asoutcomes of conditional branches, target addresses of indirect jumps andcalls, as well as sources and targets of asynchronous control transfers(e.g., exceptions, interrupts), according to various embodiments. Theexact data to be recorded in the trace may be determined by theprocessor, may be configurable (e.g., by the tracing controller, anadministrator and/or user) and/or may vary from embodiment toembodiment.

If tracing controller 170 determines that the method is stable, as willbe described in more detail subsequently, as indicated by the positiveoutput of decision block 540, tracing controller 170 may determineanother method to profile if there are additional methods to beprofiled, as shown by block 580 and the negative output of decisionblock 570. If, however, tracing controller 170 determines that themethod is not stable, as indicated by the negative output of decisionblock 540, tracing controller 170 may initiate a recompilation and/orre-optimization of the method and then continue profiling the method asshown in blocks 550 and 560.

Thus, tracing controller 170 may continue to profile and/or optimize theapplication until all methods (or at least all relevant methods) havebeen profiled and/or are determined to be stable. For instance, tracingcontroller 170 may only profile/optimize the N hottest methods of theapplication in one embodiment. In other embodiments, tracing controller170 may profile/optimize all methods in application 130 or mayprofile/optimize for a certain amount of time—profiling/optimizing asmany methods 140 as can be done during that time, etc.

As noted above, reducing the amount of data emitted by hardware tracingmay minimize the data rate, but consequently may place a burden ondecoding software. In some embodiments, a decoder library, such asdecoder 180, may be utilized to decode traces. Decoder 180 may beconfigured to decode traces at a variety of levels: simple decoding ofthe emitted packets by type; basic blocks; and/or complete control flowand instruction decoding. Decoding at a higher level of detail generallyslows decoding. For instance, full decoding may be a thousand timesslower than the original execution, in some embodiments. Decoding at thebasic block level may be sufficient for the techniques described hereinaccording to some embodiments. For instance, embodiments that utilize(or rely on) branch direction and target profiles extracted from binaryexecution traces may utilize basic block level decoding.

While some decoder libraries may be implemented in a relatively naivemanner, in some embodiments, higher-performing decoders may be used,potentially at the cost of additional complexity. For example, a tracemay be easily segmented and individual segments processed in parallel.Additionally, abstracted traces may be generated from binaries which canbe compared for divergence from expected behavior at lower performancecost than full decoding, according to some embodiments.

At a high optimization level, an optimizing compiler, such as compiler160, may no longer emit instrumentation code, and so an executionenvironment (such as a VM) may have lower visibility into the behaviorof optimized code. Low-overhead control flow traces may have severaluses, such as a source of branch behavior, to determine the relativetaken/not-taken probabilities of branches, to build a histogram oftargets of indirect calls, jumps and returns, to obtain path profiles,to augment event information with context, e.g., what was the precedingbehavior that led to an uncommon branch or trap and subsequentdeoptimization.

Additionally, in some embodiments, a binary execution trace mechanism,may be combined with other hardware performance observation techniques(e.g., counters, instruction sampling, etc.). Such a combination mayimprove data obtained from these other mechanisms, according to someembodiments.

In some embodiments, optimized recompilation using hardware tracing mayinvolve tracing selectively. For example, tracing controller 170 may beconfigured to select between tracing the content of a single executionloop or the entirety of a method. In some embodiments, tracingcontroller 170 may take advantage of address filters, which may limittracing to stated address ranges, of binary execution trace mechanismsof processor 110. For instance, a binary execution trace mechanism(and/or a processor providing one or more binary execution tracemechanisms) may provide for only a limited number of address filters,which may limit tracing to certain address ranges that can furtherreduce the generated trace data rate. In some embodiments, only twofilters may be supported. In some embodiments, controller 170 may beconfigured to time-multiplex the supported address filters among themethods to be traced so that a higher number of methods may be tracedthan the number of supported address filters.

When profiling an application, performing binary tracing for allinstructions executed by the application may result in wasted resources,such as due to tracing and/or post-processing of traces. To avoidresource waste, in some embodiments tracing controller 170 and/orcompiler 160 may determine one or more application methods that maycontribute relatively more than other methods to the total executiontime of the application. The determined methods may then be selected fortracing, profiling, recompiling and/or reoptimizing, according tovarious embodiments. In other embodiments, methods may be selected basedon other criteria, such as their relative contribution to the overallinstruction count of the application.

To determine which methods to trace, the tracing controller may maintainaddress-range metadata for methods compiled by the final compilationtier. For example, in one embodiment, the controller may maintain a mapwhere each key represents (or identifies) the address range of a method.The controller may employ sampling-based hardware performance countersto obtain the addresses of retired instructions (e.g., instructions thatwere actually executed as part of the application flow as opposed toinstructions that may be speculatively executed). As the retiredinstruction addresses are gathered, the controller may perform lookupsin the address-range map and upon a match may increment the counterassociated with the matching method. Thus, the controller may in someembodiments build a histogram of compiled methods' respectivecontribution to the total instruction count of the application. Aftersorting the histogram, the controller may then identify the top Nmethods to trace, which may be considered the ‘hottest’ methods.Similarly, a histogram representing methods' respective contribution tothe overall execution time of the application may also be generated andused to determine an order in which method may be profiled.

FIG. 6 is a logical block diagram illustrating the generation of aprofile as part of optimized recompilation using hardware tracing, inone embodiment. As described above, trace controller 170 may beconfigured to initiate hardware tracing of (at least a portion of) anexecuting application. In response, processor 110 may generate tracedata based on the execution of the traced code. For example, processor110 may generate one or more packets 610 of trace data and store them intrace buffer 190.

In some embodiments, the controller may, during tracing, trigger controlflow reconstruction whenever the trace buffer is filled. For example, inone embodiment, trace buffer 190 may be, or may include, a circulartrace buffer. Whenever the circular trace buffer wraps around, tracingcontroller 170 may initiate control flow reconstruction based oninformation within trace buffer 190 to generate reconstructed controlflow 620. Tracing may be paused until the content of the buffer isconsumed by the decoder. In some embodiments, tracing controller 170 mayutilize decoder 180 to decode the information in trace buffer 190 whenreconstructing a control flow.

Tracing controller 170 may accumulate statistics about branch directions(i.e., number of times a branch instruction is executed and number oftimes it is taken) and/or call/jump targets (i.e., target addresses) asthe decoder decodes the recorded trace (i.e., reconstructs controlflow). The collected statistics may be included in, or used to generate,profile 150. Once the content of trace buffer 190 has been decoded andthe corresponding control flow(s) reconstructed, tracing controller 170may resume tracing. Once tracing of the executing code has beencompleted (e.g., when the traced method or other code being traced, butnot necessarily the complete application, has finished execution)tracing controller 170 may post-process the profiles gathered, asdescribed below according to various embodiments.

Making use of the profiles collected through hardware binary tracingduring re-compilation may, in some embodiments, require mapping theprofiles to the application's source code. This in turn may require thecompiler to propagate source positions throughout the entire compilationpipeline and/or associate the source positions with their correspondinginstructions in the generated machine code. Thus, profiles 150 collectedthrough binary tracing of application 130 may be mapped back to thesource code positions in application 130.

After sufficient profiling data are collected (e.g., on average Ninstances per branch may be traced in a method according to someembodiments), tracing controller 170 may post-process the profiling dataand decide whether recompilation is necessary for a given method. FIG. 7is a logical block diagram illustrating the control flow for collecting,post-processing and utilize hardware traces, according to variousembodiments. After a method is compiled as in block 710, and selectedfor tracing as in block 730, hardware traces may be gathered as in block740 while the method is being executed as in block 720. After sufficientprofiling data (e.g., tracing information) are collected, tracingcontroller 170 may analyze the collected trace data, as in block 750. Asdescribed above, in some embodiments, tracing controller 170 may comparea newly generated profile with one or more previously generated profilesto determine whether or not the traced method (or, in general, anycollection of traced code) is stable or not.

For example, in one embodiment, tracing controller 170 may determinethat a recompilation of a traced method may be necessary if the profileused for compilation differs significantly from profiles previouslygathered, such as interpretation-based profiles, instrumentation basedprofiles and/or other hardware tracing based profiles. For example, if abranch was compiled with a 50% probability of being taken based on theinterpreter profiles, but the hardware profiles indicate that it istaken all the time (probability=100%), this might be an indication thatthe branch behavior changes based on the context (i.e., the executioncontext, input data, etc.), and so might be worth recompiling with therecent profile. Comparing compile-time and run-time profiles collectedthrough binary tracing may require maintaining profiling informationused at compilation time for each source position, according to oneembodiment.

In some embodiments, tracing controller 170 may decide whether torecompile a method by comparing compile-time and run-time profiles ofrelevant instructions (in some embodiments of all relevantinstructions). If the majority of the compared profiles are similar(e.g., within some predetermined and/or configurable threshold) themethod may not contain any context-dependent elements and/or there hasbeen no phase change since the method was last profiled. In this case,the controller may determine not to recompile the methods, may mark themethod as stable (e.g., so the method may not be traced anymore).Tracing controller 170 may then trace methods (e.g., from the hotmethods list) that contribute less to the total execution time thanmethods already traced/profiled (e.g., less-hot or colder methods).However, if the majority of the compile-time and run-time profilesdiffer (i.e., differ more than a predetermined and/or configurablethreshold), the method may be considered a good candidate forrecompilation by the controller. Recompiling a method whose compile-timeand run-time profiles are different may result in improved performancedue to performing context-sensitive and phase-specific optimizations,according to some embodiments.

As noted above, hardware tracing may provide context-sensitive profilingof methods that are already inlined in their respective caller method.

To provide context sensitivity, prior work has proposed profilecollection mechanisms a) through instrumentation in the less-optimizedmachine code compiled in early stages in a multi-tier compilationpipeline and b) in the interpreter. Despite their performance and codesize reduction benefits thanks to context sensitivity, these priormechanisms suffer from the same drawbacks as their context-insensitivecounterparts. First, because this kind of profiling mechanism issoftware based, they slow down the execution of the main application dueto the extra instructions that need to be executed for profiling.Second, the interpreter-based approach is language specific and eachlanguage interpreter must implement its own profiler even if multiplelanguages can use a single compiler.

The following example Java™ code listing illustrates, according to oneexample embodiment, where context insensitive profiling may result inmissing optimization opportunities that may be recognized by contextsensitive profiling, such as may be implemented by one or moretechniques for optimized recompilation using hardware tracing describedherein according to various embodiments.

class BaseType { int key; public BaseType(int key) { this.key = key; } }class Type1 extends BaseType { public Type1(int key) { super(key); }public int hashCode( ){ /** Implementation **/ } public booleanequals(Object other) { /** Implementation **/ } } class Type2 extendsBaseType { public Type2(int key) { super(key); } public int hashCode( ){/** Implementation **/ } public boolean equals(Object other) { /**Implementation **/ } } class HashMapExample { static int counter = 0;public static void main( ) { Type1 type1Object = new Type1(1); Type2type2Object = new Type2(2); HashMap<BaseType, Integer> map = newHashMap<>( ); map.put(type1Object, 1); map.put(type2Object, 2);addToCounter(map, type1Object); /** Call-site 1 (CS1) **/addToCounter(map, type2Object); /** Call-site 2 (CS2) **/ } publicstatic void addToCounter(HashMap<BaseType, Integer> map, BaseTypeobject) { counter += map.get(object).intValue( ); } } class HashMap { //Simplified HashMap.get public Object get(Object key) { int index =maskIndex(key.hashCode( )); HashMapEntry entry = elementData[index];while (entry != null) { // ... return entry; } return null; } }

For example, in the example code snippet listed above (according to oneexample embodiment), the get method of HashMap is called by two callsites in the addToCounter method and each call site calls firstaddToCounter then HashMap.get with an instance of two different types,Type 1 and Type2, as arguments. In a system performingcontext-insensitive profiling, as illustrated in FIG. 8A, the typeprofiles collected for the invocation of hashCode in the get methodwould be 50% for both Type1.hashCode 810 and Type2.hashCode 820, such asbecause the profiler may not keep track of the call contexts ofHashMap.get 800. Thus, because the probabilities of the two types areequal, the inliner might inline either both hashCode methods or neither,as there is no winner case; (e.g., neither of two types is more common).

Unlike context-insensitive profilers, a context-sensitive profiler, suchas one configured to implement one or more of the techniques foroptimized recompilation using hardware tracing described herein, maydifferentiate between the two call sites of HashMap.get 840 and 870 inaddToCounter and may associate the profiles with each call siteseparately, as illustrated in FIG. 8B. As shown in the following examplesource code listing according to one embodiment, the type profile forthe hashCode call in get would be 100% of Type1 850 for Call Site 1 830and 100% of Type2 880 for Call Site 2 860. In this case, because thereis a clear winner at both call sites, an inliner (e.g., of compiler 160)may inline the correct method at each call site, that is, Type1.hashCode850 in addToCounter.CS1 830 and Type2.hashCode 880 in addToCounter.CS1860.

While described herein as being implemented and/or performed onstand-alone computer systems, optimized recompilation using hardwaretracing described herein may also be implemented using resources of acloud computing environment, as illustrated in FIG. 9 which illustratesa cloud computing environment configured to implement the methods,mechanisms and/or techniques described herein according to oneembodiment. For example, cloud infrastructure system 930 may provide avariety of services via a cloud or network environment. These servicesmay include one or more services provided under Software as a Service(SaaS) category, Platform as a Service (PaaS) category, Infrastructureas a Service (IaaS) category, or other categories of services includinghybrid services.

As shown in FIG. 9, cloud infrastructure system 930 may comprisemultiple components, which working in conjunction, enable provision ofservices provided by cloud infrastructure system 930. In someembodiments, cloud infrastructure system 930 may include a SaaS platform950, a PaaS platform 960, an IaaS platform 970, and cloud managementfunctionality 940. These components may be implemented in hardware, orsoftware, or combinations thereof, according to various embodiments.

The provided services may be accessible over network 920 (e.g., theInternet) by one or more customers via client device(s) 910. Thus, insome embodiments, cloud infrastructure system 930 may include, host,provision and/or provide any and or all of processor(s) 110, executionenvironment 120, application 130, compiler 160, tracing controller 170,and/or decoder 180 to customers over network 920. Thus, in someembodiments, optimized recompilation using hardware tracing, includingthe methods, mechanisms and/or techniques described herein, may beimplemented and/or provided as part of an SAAS platform, PAAS platformand/or IAAS platform offered by a cloud infrastructure system.

In addition to recording the flow of control, a binary execution tracemechanism may be configured to annotate the trace with timinginformation (e.g., high-precision and/or cycle-accurate timinginformation). Two types of information may be available: (i) time, suchas may be measured by a real-time clock, and/or (ii) the ratio of thecore's clock cycle to the bus clock cycle. A record including a newratio may be emitted into the trace whenever the ratio changes,according to some embodiments. A record of dynamic frequency changes,such as may be caused by dynamically switched processor clock speeds(e.g., in response to thermal events) may be utilized to account forperformance variations, and may make easier to identify variations dueto other effects, such as cache hits and misses. Additionally, in someembodiments high-precision timed traces may enable more detailedregression analysis.

Thus, optimized recompilation using hardware tracing may provideextensive, as well as context-sensitive, profiling data for an executingapplication, and may deliver application performance improvements, suchas by providing improved guidance of compiler optimizations, accordingto various embodiments.

Computing System Example

The systems and methods described herein may be implemented on or by anyof a variety of computing systems, in different embodiments. FIG. 10illustrates a computing system 1000 that is configured to implement thedisclosed techniques, according to various embodiments. The computersystem 1000 may be any of various types of devices, including, but notlimited to, a personal computer system, desktop computer, laptop ornotebook computer, mainframe computer system, cloud computing system,handheld computer, workstation, network computer, a consumer device,application server, storage device, a peripheral device such as aswitch, modem, router, etc., or in general any type of computing device.

The mechanisms for implementing the techniques described herein, may beprovided as a computer program product, or software, that may include anon-transitory, computer-readable storage medium having stored thereoninstructions, which may be used to program a computer system 1000 (orother electronic devices) to perform a process according to variousembodiments. A computer-readable storage medium may include anymechanism for storing information in a form (e.g., software, processingapplication) readable by a machine (e.g., a computer). Themachine-readable storage medium may include, but is not limited to,magnetic storage medium (e.g., floppy diskette); optical storage medium(e.g., CD-ROM); magneto-optical storage medium; read only memory (ROM);random access memory (RAM); erasable programmable memory (e.g., EPROMand EEPROM); flash memory; electrical, or other types of medium suitablefor storing program instructions. In addition, program instructions maybe communicated using optical, acoustical or other form of propagatedsignal (e.g., carrier waves, infrared signals, digital signals, etc.).

In various embodiments, computer system 1000 may include one or moreprocessors 1070, such as one or more tracing capable processors 110,each of which may include multiple cores and any of which may be singleor multi-threaded. For example, multiple processor cores may be includedin a single processor chip (e.g., a single processor 1070), and multipleprocessor chips may be included in computer system 1000. The computersystem 1000 may also include one or more persistent storage devices 1050(e.g. optical storage, magnetic storage, hard drive, tape drive, solidstate memory, etc.)

and one or more system memories 1010 (e.g., one or more of cache, SRAM,DRAM, RDRAM, EDO RAM, DDR 10 RAM, SDRAM, Rambus RAM, EEPROM, etc.).Various embodiments may include fewer or additional components notillustrated in FIG. 10 (e.g., video cards, audio cards, additionalnetwork interfaces, peripheral devices, a network interface such as anATM interface, an Ethernet interface, a Frame Relay interface, etc.)

The one or more processors 1070, the storage device(s) 1050, and thesystem memory 1010 may be coupled to the system interconnect 1040. Oneor more of the system memories 1010 may contain program instructions1020. Program instructions 1020 may be executable to implement one ormore applications 1022, such as application 130, compiler 160, tracingcontroller 170 and/or decoder 190. Program instructions 1020 may beexecutable to implement shared libraries 1024, and/or operating systems1026. In some embodiments, program instructions 1020 may include acompiler 1028, such as compiler 160. In some embodiments, compiler 1028may be an optimizing compiler that is configured to apply one or moretransformations and/or optimizations to application or library code thatis executable to implement the disclosed methods, techniques and/ormechanisms.

Program instructions 1020 may be encoded in platform native binary, anyinterpreted language such as Java™ byte-code, or in any other languagesuch as C/C++, Java™, etc. or in any combination thereof. In variousembodiments, compiler 1028, applications 1022, operating system 1026,and/or shared libraries 1024 may each be implemented in any of variousprogramming languages or methods. For example, in one embodiment,compiler 1028 and operating system 1026 may be Java based, while inanother embodiment they may be written using the C or C++ programminglanguages. Similarly, applications 1022 may be written using Java, C,C++, or another programming language, according to various embodiments.Moreover, in some embodiments, compiler 1028, applications 1022,operating system 1026, and/shared libraries 1024 may not be implementedusing the same programming language. For example, applications 1022 maybe Java based, while compiler 1028 may be developed using C or C++.

The program instructions 1020 may include operations, or procedures,and/or other processes for implementing the techniques described herein.Such support and functions may exist in one or more of the sharedlibraries 1024, operating systems 1026, compiler 1028, applications1022, compiler 160, tracing controller 170, and/or, decoder 180, invarious embodiments.

Although the embodiments above have been described in considerabledetail, numerous variations and modifications will become apparent tothose skilled in the art once the above disclosure is fully appreciated.For example, although many of the embodiments are described in terms ofparticular types of lock structures, policies, and proceduresparticular, it should be noted that the techniques and mechanismsdisclosed herein may be applicable in other contexts in which criticalsections of code and/or shared resources may be protected by other typesof locks/structures under different policies or procedures than thosedescribed in the examples herein. It is intended that the followingclaims be interpreted to embrace all such variations and modifications.

What is claimed:
 1. A computing device, comprising: one or moreprocessors configured to generate hardware tracing information for anapplication executing on the one or more processors; and a memorycomprising program instructions, that when executed on the one or moreprocessors cause the processors to implement a tracing controllerconfigured to: select, to define a portion of the application to trace,between (a) an entirety of the application and (b) one or more methodsof a plurality of methods of the application, wherein the plurality ofmethods comprises the one or more methods and other methods, and whereinthe selecting is based at least in part on respective contributions to atotal execution time of the application by individual ones of theplurality of methods; instruct the one or more processors to initiatehardware tracing of the selected portion of the application; generate anew profile of the selected portion based on the hardware tracinginformation; determine whether the new profile differs more than athreshold amount from a previously generated profile for the selectedportion; and in response to determining that the new profile differsmore than the threshold amount from the previously generated profile,initiate re-compilation of the selected portion.
 2. The system of claim1, wherein the tracing controller is further configured to select basedat least in part on one or more of: respective contributions to a totalinstruction count of the application by individual ones of the pluralityof methods.
 3. The system of claim 1, wherein in response to determiningthat the new profile does not differ more than the threshold amount fromthe previously generated profile, the tracing controller is furtherconfigured to: instruct the one or more processors to discontinuehardware tracing of the selected portion; and instruct the one or moreprocessors to initiate hardware tracing of another portion of theapplication.
 4. The system of claim 3, wherein to instruct the one ormore processors to initiate hardware tracing of another portion of theapplication, the tracing controller is configured to: select whether toinitiate tracing of another single execution loop of another method oran entirety of the other method; and wherein in response to selecting toinitiate tracing of the entirety of the other method, the other portioncomprises the entirety of the other method and in response to selectingto initiate tracing of the other single execution loop, the otherportion comprises the other single execution loop but less than theentirety of the other method.
 5. The system of claim 1, wherein thepreviously generated profile is generated by an interpreter prior to thetracing controller instructing the one or more processors to initiatehardware tracing of the selected portion.
 6. The system of claim 1,wherein the previously generated profile is generated based on hardwaretracing information generated prior to generating the hardware tracinginformation upon which the new profile is based.
 7. The system of claim1, wherein to instruct the one or more processors to initiate tracing ofthe selected portion, the tracing controller is further configured toinstruct the one or more processors to generate trace information forapplication code executing within one or more address rangescorresponding to the selected portion.
 8. The system of claim 1, whereinthe hardware tracing information comprises one or more of: an outcome ofone or more conditional branches within the portion; a target address ofan indirect jump within the portion; a target address of an indirectcall within the portion; and a source and target of an asynchronouscontrol transfer within the portion.
 9. The system of claim 1, whereinthe new profile comprises statistics regarding branch directions,wherein the statistics comprise: a number of times a conditional branchinstruction is executed within the portion; and a number of times theconditional branch instruction is taken.
 10. A computer-implementedmethod, comprising: selecting, to define a traced portion of anapplication executing on one or more processors to trace, between (a) anentirety of the application and (b) one or more methods of a pluralityof methods of the application, wherein the plurality of methodscomprises the one or more methods and other methods, wherein the one ormore processors are configured to generate hardware tracing informationfor application code executing on the one or more processors, andwherein the selecting is based at least in part on respectivecontributions to a total execution time of the application by individualones of the plurality of methods; initiating hardware tracing for theselected portion executing on the one or more processors; generating anew profile of the selected portion based on the hardware tracinginformation; determining whether the new profile differs more than athreshold amount from a previously generated profile for the selectedportion; and re-compiling the selected portion in response todetermining that the new profile differs more than the threshold amountfrom the previously generated profile.
 11. The method of claim 10,wherein selecting is further based at least in part on: the respectivecontributions to a total instruction count of the application byindividual ones of the plurality of methods.
 12. The method of claim 10,further comprising, in response to determining that the new profile doesnot differ more than the threshold amount from the previously generatedprofile: discontinuing hardware tracing of the selected portion; andinitiating hardware tracing of another portion of the application. 13.The method of claim 12, wherein said initiating hardware tracing ofanother portion of the application comprises: selecting whether toinitiate tracing of another single execution loop of another method ofthe application or an entirety of the other method; and wherein inresponse to selecting to initiate tracing of the entirety of the othermethod, the other portion comprises the entirety of the other method andin response to selecting to initiate tracing of the other singleexecution loop, the other portion comprises the other single executionloop but less than the entirety of the other method.
 14. The method ofclaim 10, wherein the previously generated profile is generated by aninterpreter prior to said initiating hardware tracing of the selectedportion.
 15. The method of claim 10, wherein the previously generatedprofile is generated based on previous hardware tracing informationgenerated prior to the hardware tracing information upon which the newprofile is based.
 16. The method of claim 10, wherein said initiatinghardware tracing of the selected portion comprises instructing the oneor more processors to generate trace information for application codeexecuting within one or more address ranges corresponding to theselected portion.
 17. One or more non-transitory, computer-readablestorage media storing program instructions that when executed on oracross one or more processors cause the one or more processors toperform: selecting, to define a portion of an application executing onone or more processors to trace, between (a) an entirety of theapplication and (b) one or more methods of a plurality of methods of theapplication, wherein the plurality of methods comprises the one or moremethods and other methods, wherein the one or more processors areconfigured to generate hardware tracing information for application codeexecuting on the one or more processors, and wherein the selecting isbased at least in part on respective contributions to a total executiontime of the application by individual ones of the plurality of methods;initiating hardware tracing for the selected portion executing on theone or more processors; generating a new profile of the selected portionbased on the hardware tracing information; determining whether the newprofile differs more than a threshold amount from a previously generatedprofile for the selected portion; and re-compiling the selected portionin response to determining that the new profile differs more than thethreshold amount from the previously generated profile.
 18. The one ormore non-transitory, computer-readable storage media of claim 17,wherein selecting is based at least in part on: the respectivecontributions to a total instruction count of the application byindividual ones of the plurality of methods.
 19. The one or morenon-transitory, computer-readable storage media of claim 17, wherein inresponse to determining that the new profile does not differ more thanthe threshold amount from the previously generated profile, the programinstructions cause the computing device to perform: discontinuinghardware tracing of the selected portion; and initiating hardwaretracing of another portion of the application.
 20. The one or morenon-transitory, computer-readable storage media of claim 19, whereinsaid instructing the one or more processors to initiate hardware tracingof another portion of the application, comprises: selecting whether toinitiate tracing of another single execution loop of another method ofthe application or an entirety of the other method; and wherein inresponse to selecting to initiate tracing of the entirety of the othermethod, the other portion comprises the entirety of the other method andin response to selecting to initiate tracing of the other singleexecution loop, the other portion comprises the other single executionloop but less than the entirety of the other method.